Forms are our gates to the web. They enable us to access the deep content ofweb sites. Automatic form understanding provides applications, ranging fromcrawlers over meta-search engines to service integrators, with a key to thiscontent. Yet, it has received little attention other than as component inspecific applications such as crawlers or meta-search engines. No comprehensiveapproach to form understanding exists, let alone one that produces rich modelsfor semantic services or integration with linked open data. In this paper, we present OPAL, the first comprehensive approach to formunderstanding and integration. We identify form labeling and forminterpretation as the two main tasks involved in form understanding. On bothproblems OPAL pushes the state of the art: For form labeling, it combinesfeatures from the text, structure, and visual rendering of a web page. Inextensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modernweb forms OPAL outperforms previous approaches for form labeling by asignificant margin. For form interpretation, OPAL uses a schema (or ontology)of forms in a given domain. Thanks to this domain schema, it is able to producenearly perfect (more than 97 percent accuracy in the evaluation domains) forminterpretations. Yet, the effort to produce a domain schema is very low, as weprovide a Datalog-based template language that eases the specification of suchschemata and a methodology for deriving a domain schema largely automaticallyfrom an existing domain ontology. We demonstrate the value of the forminterpretations in OPAL through a light-weight form integration system thatsuccessfully translates and distributes master queries to hundreds of formswith no error, yet is implemented with only a handful translation rules.
展开▼